home *** CD-ROM | disk | FTP | other *** search
- Path: locutus.rchland.ibm.com!usenet
- From: Philip Staite <pstaite+@rchland.ibm.com>
- Newsgroups: comp.lang.c++
- Subject: Re: Special sort algorithm
- Date: Fri, 29 Mar 1996 09:21:22 -0600
- Organization: IBM Rochester, MN
- Message-ID: <315BFFF2.167E@rchland.ibm.com>
- References: <31593E09.29A0@kmd.dk>
- NNTP-Posting-Host: powertool.rchland.ibm.com
- Mime-Version: 1.0
- Content-Type: text/plain; charset=us-ascii
- Content-Transfer-Encoding: 7bit
- X-Mailer: Mozilla 2.01 (X11; I; AIX 1)
-
- Carsten Bonde wrote:
- >
- > Hi
- >
- > This is not strictly related to C++, but I guess some of you may have a hint
- > on this anyway.
- >
- > I am trying to make a DOS file sorting program, which is to be part of a commercial product.
- > It must :
- > - Sort big files of short lines
- > lines are uneven length
- > files are XX MB
- > lines are < 100 chars
- > - Put resulting file in original file
- > - Run in very little memory ( < 50 K)
- > - Use about no extra disk space ( < 10% extra)
- > - Run reasonably fast
- >
- > I have tried implementing in-file bubblesort and others with the language
- > being C++. But they are way too slow, even with some optimizing buffers.
- >
- > Does anyone have a suggestion for a good algorithm, or maybe even source.
-
- Try this approach. Although it won't be blindingly fast, it should get
- the job done... Note, if you could relax the disk space requirements to
- allow for 2x disk storage (ie. 100% extra) then you could use a
- conventional external merge sort and it'd run pretty fast. Come on,
- disk space is cheap, an extra 50M of disk could improve speed by an
- order of magnitude here...
-
- Anyway, consider this algorithm:
-
- 1) Break up the initial file into < 20K sections. To do this, stat() or
- fstat() the file for total length. Subtrackt 20K from that, seek to
- that position, then move fwd to the first newline. Remember this
- position. Copy the remainder of the file to a temp disk file. Use
- truncate() or ftruncate() to chop the original file's size. (lopping off
- the chunk just saved to the temp file) Repeat this until original file
- has been diced into N files of 20K or less. This shouldn't require more
- than 20K over the initial file size.
-
- 2) Block bubble sort. Read two of the temp files into a 40K buffer.
- Perform a quicksort on this buffer. (see qsort) There should be
- approximately 400 lines in the buffer.
-
- 3) Keep the first 200 lines in the buffer, write the "last" 200 (or
- less) back to the last file read.
-
- 4) Read the "next" file into the slots in the 40K buffer just flushed to
- the file.
-
- 5) Perform quicksort again.
-
- 6) Repeat steps 3, 4, and 5 until there is no "next" file.
-
- 7) Now, these first 200 lines are guaranteed to be the first 200 lines
- in the whole data set. Write these to the original file.
-
- 8) Copy the "last" temp file to the first, delete the last.
-
- 9) Repeat steps 2-6 to get the next 200 lines in the file.
-
- 10) Append these 200 lines to the original file.
-
- 11) Repeat steps 8-10 until there are no more temp files.
-
-
- What this is really doing is partitioning the original file into 20K
- segments. Then performing a bubble sort on the segments where two at a
- time are brought into memory to compare. (external bubble, ugh...)
-
- Potential problems with this....
-
- When you run qsort it may pick a really bad pivot element. After the
- first pass, each "half" of the buffer is partially sorted. Therefore if
- qsort arbitrarily picks the middle element it'll actually get a pivot
- that is offset well into the sorted array. May want to adjust your
- buffer size to 15K and maintain a 15K buffer and a 30K buffer. This way
- you would only have to do an initial qsort on the unsorted parts (say,
- as you partitioned the original file). Thereafter, you would just
- perform an internal merge of the "held" 15K and the 15K comming in from
- a temp file.
-
- This may generate a LOT of files. If the original file is 50M and you
- partition it into 20K files that is 2500 files. DOS is ill-equipped to
- deal with that many files. That is 80K of directory entries alone on
- disk. Note, DOS doesn't recover directory space when the files are
- deleted, so you'll want to make sure you do this in a temp direcotory
- then remove the directory to recover that 80K...
-
-
- --
-
- Phil Staite, (507) 253-2529, team OS/2
- internet: pstaite@vnet.ibm.com internal: pstaite@rchland
-